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Abstract 



The paper studies machine learning problems where each example is described using a 
set of Boolean features and where hypotheses are represented by linear threshold elements. 
One method of increasing the expressiveness of learned hypotheses in this context is to 
expand the feature set to include conjunctions of basic features. This can be done explicitly 
or where possible by using a kernel function. Focusing on the well known Perceptron 
and Winnow algorithms, the paper demonstrates a tradeoff between the computational 
efficiency with which the algorithm can be run over the expanded feature space and the 
generalization ability of the corresponding learning algorithm. 

We first describe several kernel functions which capture either limited forms of con- 
junctions or all conjunctions. We show that these kernels can be used to efficiently run 
the Perceptron algorithm over a feature space of exponentially many conjunctions; how- 
ever we also show that using such kernels, the Perceptron algorithm can provably make an 
exponential number of mistakes even when learning simple functions. 

We then consider the question of whether kernel functions can analogously be used 
to run the multiplicative-update Winnow algorithm over an expanded feature space of 
exponentially many conjunctions. Known upper bounds imply that the Winnow algorithm 
can learn Disjunctive Normal Form (DNF) formulae with a polynomial mistake bound in 
this setting. However, we prove that it is computationally hard to simulate Winnow's 
behavior for learning DNF over such a feature set. This implies that the kerncil functions 
which correspond to running Winnow for this problem are not efficiently computable, and 
that there is no general construction that can run Winnow with kernels. 

1. Introduction 

The problem of classifying objects into one of two classes being "positive" and "negative" 
examples of a concept is often studied in machine learning. The task in machine learning 
is to extract such a classifier from given pre-classified examples - the problem of learning 
from data. When each example is represented by a set of n numerical features, an example 
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can be seen as a point in Euclidean space JR"'. A common representation for classifiers in 
this case is a hyperplane of dimension (n — 1) which splits the domain of examples into 
two areas of positive and negative examples. Such a representation is known as a linear 
threshold function, and many learning algorithms that output a hypothesis represented in 
this manner have been developed, analyzed, implemented, and applied in practice. Of 
particular interest in this paper arc the well known Perceptron (Rosenblatt, 1958; Block, 
1962; Novikoff, 1963) and Winnow (Littlestone, 1988) algorithms that have been intensively 
studied in the literature. 

It is also well known that the expressiveness of linear threshold functions is quite lim- 
ited (Minsky & Papert, 1968). Despite this fact, both Perceptron and Winnow have been 
applied successfully in recent years to several large scale real world classification problems. 
As one example, the SNoW system (Roth, 1998; Carlson, Cumby, Rosen, & Roth, 1999) has 
successfully applied variations of Perceptron and Winnow to problems in natural language 
processing. The SNoW system extracts basic Boolean features xi, . . . ,Xn from labeled pieces 
of text data in order to represent the examples, thus the features have numerical values re- 
stricted to {0, 1}. There are several ways to enhance the set of basic features xi, . . . , Xji 
for Perceptron or Winnow. One idea is to expand the set of basic features xi,. . . ,Xn using 
conjunctions such as (xi Ax^ Ax^) and use these expanded higher-dimensional examples, in 
which each conjunction plays the role of a basic feature, as the examples for Perceptron or 
Winnow. This is in fact the approach which the SNoW system takes running Perceptron or 
Winnow over a space of restricted conjunctions of these basic features. This idea is closely 
related to the use of kernel methods, sec e.g. the book of Cristianini and Shawe- Taylor 
(2000), where a feature expansion is done implicitly through the kernel function. The ap- 
proach clearly leads to an increase in expressiveness and thus may improve performance. 
However, it also dramatically increases the number of features (from n to 3" if all conjunc- 
tions arc Tised), and thus may adversely affect both the computation time and convergence 
rate of learning. The paper provides a theoretical study of the performance of Perceptron 
and Winnow when run over expanded feature spaces such as these. 

1.1 Background: On-Line Learning with Perceptron and Winnow 

Before describing our results, we recall some necessary background on the on-line learning 
model (Littlestone, 1988) and the Perceptron and Winnow algorithms. 

Given an instance space X of possible examples, a concept is a mapping of instances into 
one of two (or more) classes. A concept class C C 2^^^ is simply a set of concepts. In on-line 
learning a concept class C is fixed in advance and an adversary can pick a concept c G C. 
The learning is then modeled as a repeated game where in each iteration the adversary 
picks an example x G X, the learner gives a guess for the value of c{x) and is then told the 
correct value. We count one mistake for each iteration in which the value is not predicted 
correctly. A learning algorithm learns a concept class C with mistake bound M if for any 
choice of c G C and any (arbitrarily long) sequence of examples, the learner is guaranteed 
to make at most M mistakes. 

In this paper we consider the case where the examples are given by Boolean features, 
that is X = {0, 1}", and we have two class labels denoted by —1 and 1. Thus for x S {0, 1}", 
a labeled example {x, 1) is a positive example, and a labeled example {x, —1) is a negative 
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example. The concepts we consider arc built using logical combinations of the n base 
features and we are interested in mistake bounds that are polynomial in n. 

1.1.1 Perceptron 

Throughout its execution Perceptron maintains a weight vector w G 3f?^ which is initially 
(0, ...,0). Upon receiving an example x G the algorithm predicts according to the 
linear threshold function w ■ x > 0. li the prediction is 1 and the label is —1 (false positive 
prediction) then the vector w is set to w — x, while if the prediction is —1 and the label is 1 
(false negative) then w is set to w + x. No change is made to w if the prediction is correct. 
Many variants of this basic algorithm have been proposed and studied and in particular one 
can add a non zero threshold as well as a learning rate that controls the size of update to 
w. Some of these are discussed further in Section 3. 

The famous Perceptron Convergence Theorem (Rosenblatt, 1958; Block, 1962; Novikoff, 
1963) bounds the number of mistakes which the Perceptron algorithm can make: 

Theorem 1 Let {x^,yi), . . . , (x*, yt) be a sequence of labeled examples with G , < 
R and yi G {—1, 1} for all i. Let u G 5?^,.^ > 6e such that yi{u • x*) > ^ for all i. Then 
Perceptron makes at most ^ mistakes on this example sequence. 

1.1.2 Winnow 

The Winnow algorithm (Littlestone, 1988) has a very similar structure. Winnow maintains 

a hypothesis vector w G 9?'^ which is initially w = (1,...,1). Winnow is parameterized by 
a promotion factor a > 1 and a threshold ^ > 0; upon receiving an example x G {0, 1}^ 
Winnow predicts according to the threshold function w-x>9.1i the prediction is 1 and the 
label is —1 then for all i such that Xj = 1 the value of Wi is set to Wi/a; this is a demotion 
step. If the prediction is —1 and the label is 1 then for all i such that Xj = 1 the value of Wi 
is set to awi] this is a promotion step. No change is made to w if the prediction is correct. 

For our purposes the following mistake bound, implicit in Littlestone's work (1988), is 
of interest: 

Theorem 2 Let the target function be a k-literal monotone disjunction /(xi, . . . , .t^t) = 
Xj^ V • • • V Xij. . For any sequence of examples in {0, 1}''^ labeled according to f the number 
of prediction mistakes made by Winnow{a, 9) is at most ■ ^ + k{a + 1)(1 + log^ 6). 

1.2 Our Results 

We are interested in the computational efficiency and convergence of the Perceptron and 

Winnow algorithms when run over expanded feature spaces of conjunctions. Specifically, 
we study the use of kernel functions to expand the feature space and thus enhance the 
learning abilities of Perceptron and Winnow; we refer to these enhanced algorithms as 
kernel Perceptron and kernel Winnow. 

Our first result (cf. also the papers of Sadohara, 1991; Watkins, 1999; and Kowalczyk 
et al., 2001) uses kernel functions to show that it is possible to efficiently run the kernel 
Perceptron algorithm over an exponential number of conjunctive features. 
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Result 1: (see Theorem 3) There is an algorithm that simulates Perceptron over the 3"- 
dimcnsional feature space of all conjunctions of n basic features. Given a sequence of t 
labeled examples in {0, l}*^ the prediction and update for each example take poly(n,t) time 
steps. We also prove variants of this result in which the expanded feature space consists of 
all monotone conjunctions or all conjunctions of some bounded size. 

This result is closely related to one of the main open problems in learning theory: 
efficient Icarnability of disjunctions of conjunctions, or DNF (Disjunctive Normal Form) 
expressions.^ Since linear threshold elements can represent disjunctions (e.g. xi V 0:2 V 0:3 
is true iff xi + X2 + xs > 1), Theorem 1 and Result 1 imply that kernel Perceptron can be 
used to learn DNF. However, in this framework the values of N and R in Theorem 1 can be 
exponentially large (note that we have A'^ = 3"" and R = 2"/^ if all conjunctions are used), 
and hence the mistake bound given by Theorem 1 is exponential rather than polynomial 
in n. The question thus arises whether the exponential upper bound implied by Theorem 
1 is essentially tight for the kernel Perceptron algorithm in the context of DNF learning. 
We give an affirmative answer, thus showing that kernel Perceptron cannot efficiently learn 
DNF. 

Result 2: There is a monotone DNF / over xi, . . . , x„ and a sequence of examples labeled 
according to / which causes the kernel Perceptron algorithm to make 2^("') mistakes. This 
result holds for generalized versions of the Perceptron algorithm where a fixed or updated 
threshold and a learning rate are used. We also give a variant of this result showing 
that kernel Perceptron fails in the Probably Approximately Correct (PAC) learning model 
(Valiant, 1984) as well. 

Turning to Winnow, an attractive feature of Theorem 2 is that for suitable a, 6 the bound 
is logarithmic in the total number of features N (e.g. a = 2 and 9 = N). Therefore, as 
noted by several researchers (Maass & Warmuth, 1998), if a Winnow analogue of Theorem 3 
could be obtained this would imply that DNF can be learned by a computationally efficient 
algorithm with a poly(n)-mistake bound. However, we give strong evidence that no such 
Winnow analogue of Theorem 3 can exist. 

Result 3: There is no polynomial time algorithm which simulates Winnow over exponen- 
tially many monotone conjunctive features for learning monotone DNF unless every problem 
in the complexity class #P can be solved in polynomial time. This result holds for a wide 
range of parameter settings in the Winnow algorithm. 

We observe that, in contrast to this negative result, Maass and Warmuth have shown 
that the Winnow algorithm can be simulated efficiently over exponentially many conjunctive 
features for learning some simple geometric concept classes (Maass & Warmuth, 1998). 

Our results thus indicate a tradeoff between computational efficiency and convergence 
of kernel algorithms for rich classes of Boolean functions such as DNF formulas; the kernel 

1. Angluin (1990) proved that DNF expressions cannot be learned efficiently using equivalence queries 
whose hypotheses are themselves DNF expressions. Since the model of exact learning from equivalence 
queries only is equivalent to the mistake bound model which we consider in this paper, her result implies 
that no online algorithm which uses DNF formulas as hypotheses can efficiently learn DNF. However, 
this result does not preclude the efficient learnability of DNF using a different class of hypotheses. The 
kernel Perceptron algorithm generates hypotheses which are thresholds of conjunctions rather than DNF 
formulas, and thus Angluin's negative results do not apply here. 
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Perceptron algorithm is computationally efficient to run but has exponentially slow con- 
vergence, whereas kernel Winnow has rapid convergence but seems to require exponential 
runtime. 

2. Kernel Perceptron with Many Features 

It is well known that the hypothesis w of the Perceptron algorithm is a linear combination 
of the previous examples on which mistakes were made (Cristianini & Shaw- Taylor, 2000). 

More precisely, if we let L{v) € 1} denote the label of example v, then wc have that 
w = J2veM L{v)v where M is the set of examples on which the algorithm made a mistake. 
Thus the prediction of Perceptron on x is 1 iff it; -x = {J^veM L(y)v)-x = J^veM L{v){y-x) > 
0. 

For an example x G {0, 1}" let ^(x) denote its transformation into an enhanced feature 
space such as the space of all conjunctions. To run the Perceptron algorithm over the 
enhanced space we must predict 1 iff w'^ ■ 4>{x) > where w'^ is the weight vector in 
the enhanced space; from the above discussion this holds iff Z^rieM -^(^)(</'(^) ' ^ 0- 

Denoting K{v,x) = (f){v) ■ (f){x) this holds iff X^^gji^ I/(ii)if(v, x) > 0. 

Thus we never need to construct the enhanced feature space explicitly; in order to run 
Perceptron we need only be able to compute the kernel function K{v, x) efficiently. This is 
the idea behind all so-called kernel methods, which can be applied to any algorithm (such 
as support vector machines) whose prediction is a function of inner products of examples. 
A more detailed discussion is given in the book of Cristianini and Shawe- Taylor (2000). 
Thus the next theorem is simply obtained by presenting a kernel function capturing all 
conjunctions. 

Theorem 3 There is an algorithm that simulates Perceptron over the feature spaces of 

(1) all conjunctions, (2) all monotone conjunctions, (3) conjunctions of size < k, and (4) 
monotone conjunctions of size < k. Given a sequence oft labeled examples in {0,1}" the 
prediction and update for each example take poly{n, t) time steps. 

Proof: For case (1) (/){■) includes all 3" conjunctions (with positive and negative literals) and 
K{x, y) must compute the number of conjunctions which are true in both x and y. Clearly, 
any literal in such a conjunction must satisfy both x and y and thus the corresponding bit 
in X, y must have the same value. Thus each conjunction true in both x and y corresponds 
to a subset of such bits. Counting all these conjunctions gives K{x,y) = 2'^^™^(^'y) where 
same(x, y) is the number of original features that have the same value in x and y, i.e. the 
number of bit positions i which have Xi = yi. This kernel has been obtained independently 
by Sadohara (2001). 

To express all monotone monomials as in (2) we take K{x, y) = 2l^'~'^l where |x Pi y\ is 
the number of active features common to both x and y, i.e. the number of bit positions 
which have Xj = yj = 1. 

Similarly, for case (3) the number of conjunctions that satisfy both x and y is K{x,y) = 
Ez=o {'"^'^''1'''^'^) . This kernel is reported also by Watkins (1999). For case (4) we have 

i^(x,y) = Ef=o ('?')• □ 
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3. Kernel Perceptron with Many Mistakes 

In this section we describe a simple monotone DNF target function and a sequence of 
labeled examples which causes the monotone monomials kernel Perceptron algorithm to 
make exponentially many mistakes. 

For x,y € {0, 1}" we write \x\ to denote the number of I's in x and, as described above, 
\xr]y\ to denote the number of bit positions i which have Xi = yi = 1. We need the following 
well-known tail bound on sums of independent random variables which can be found in, 
e.g.. Section 9.3 of the book by Kearns and Vazirani (1994): 

Fact 4 Let Xi, . . . , be a sequence of m independent 0/1-valued random variables, each 
of which has E[Xi\ = p. Let X denote YLlLi Xi, so E[X] = pm. Then for < 7 < 1, we 
have 

Pr[X > (1 + 7)pm] < e-"^*'^'/^ and Pr[X < (1 - ^)pm] < e-"*^^'/^ 
We also use the following combinatorial property: 

Lemma 5 There is a set S of n-bit strings S = {x^, . . . C {0, l}" with t = 6"/^^"° 
such that \x^\ = n/20 for I <i <t and \x^ Hx^ < n/80 for I <i < j <t. 

Proof: We use the probabilistic method. For each i = 1, . . . , t let a;* G {0, 1}" be chosen 
by independently setting each bit to 1 with probability 1/10. For any i it is clear that 

£^[|a;*|] = n/10. Applying Fact 4, wc have that Pr[|a;*I < n/20] < e^"/^°, and thus the 
probability that any satisfies < n/20 is at most te^"/*^*^. Similarly, for any i ^ j we 
have E[\x' n x^] = n/100. Applying Fact 4 we have that Pr[|a;* nx^\ > n/80] < 6""/^^"°, 
and thus the probability that any x^,x^ with i ^ j satisfies fl a;^| > n/80 is at most 
(t)g-n/4800^ For t = e"/9600 the value of Qe~''/^^'^'^ + te'"^/^^ is less than 1. Thus for some 
choice of x^, . . . , X* we have each |x*| > n/20 and |x* n x^\ < n/80. For any x' which has 
|x*| > n/20 we can set |x*[ — n/20 of the Is to Os, and the lemma is proved. □ 

Now using the previous lemma we can construct a difficult data set for kernel Perceptron: 

Theorem 6 There is a monotone DNF f over xi , . . . , x„ and a sequence of examples labeled 
according to f which causes the kernel Perceptron algorithm to make 2^^"^ mistakes. 

Proof: The target DNF with which we will use is very simple: it is the single conjunction 

X1X2 . . . Xn- While the original Perceptron algorithm over the n features xi, . . . , x„ is easily 
seen to make at most poly(n) mistakes for this target function, we now show that the 
monotone kernel Perceptron algorithm which runs over a feature space of all 2" monotone 
monomials can make 2 + e^/^^^^ mistakes. 

Recall that at the beginning of the Perceptron algorithm's execution all 2" coordinates 
of w'^ are 0. The first example is the negative example 0". The only monomial true in this 
example is the empty monomial which is true in every example. Since ■w'^ • = Per- 
ceptron incorrectly predicts 1 on this example. The resulting update causes the coefficient 

corresponding to the empty monomial to become —1 but all 2" — 1 other coordinates 
of w'^ remain 0. The next example is the positive example 1". For this example we have 
w'^ ■ 4>{x) = — 1 so Perceptron incorrectly predicts —1. Since all 2" monotone conjunctions 
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are satisfied by this example the resulting update causes to become and all 2" — 1 

other coordinates of w'^ to become 1. The next e"'^^^^^ examples are the vectors x^, . . . 
described in Lemma 5. Since each such example has = n/20 each example is negative; 
however as we now show the Perceptron algorithm will predict 1 on each of these examples. 
Fix any value 1 < i < e^^^^^^ and consider the hypothesis vector w'^ just before example 
is received. Since \x'^\ = n/20 the value of w*^ ■ is a sum of the 2"/20 different 

coordinates Wj^ which correspond to the monomials satisfied by .x*. More precisely we have 
w"^ ■ ^(x*) = J2TeAi '"^T + SreSi where Ai contains the monomials which are satisfied 
by and x^ for some j ^ i and Bi contains the monomials which are satisfied by but 
no x^ with j i. We lower bound the two sums separately. 

Let T be any monomial in Ai. By Lemma 5 any T E Ai contains at most n/80 variables 
and thus there can be at most J2r=o ("(-^'^) monomials in A^. Using the well known bound 
J2'^ioCj) = 2(^(")+°(i))^ where < a < 1/2 and H{p) = -plogp ~ (1 -j9)log(l - p) is 
the binary entropy function, which can be found e.g. as Theorem 1.4.5 of the book by 
Van Lint (1992), there can be at most 20-8"3 (n/20)+o(n) ^ 20 041" ^^^^^ ^. Moreover 
the value of each Wj, must be at least _e"/9600 g^j^ce wi^ decreases by at most 1 for each 
example, and hence EtsA, > -e"/9*^0020.04in > _20.042n^ Qn the other hand, any T £ Bi 
is false in all other examples and therefore Wj, has not been demoted and Wj^ = 1. By 
Lemma 5 for any r > n/80 every r- variable monomial satisfied by Xi must belong to Bi, 

and hence J^TeBi — '^r=n/ 80+i 

> 20 04^'*. Combining these inequalities we have 
w ■ x^ > — 20 042n _|_ 20.049n q ^.^^^ hence the Perceptron prediction on is 1. □ 

Remark 7 At first sight it might seem that the result is limited to a simple special case of 
the perceptron algorithm. Several variations exist that use: an added feature with a fixed 
value that enables the algorithm to update the threshold indirectly (via a weight w), a non 
zero fixed (initial) threshold 9, and a learning rate a, and in particular all these three can 
be used simultaneously. The generalized algorithm predicts according to the hypothesis 
w ■ X + w > 9 and updates w <^ w -\- ax and i?) i& + a for promotions and similarly 
for demotions. We show here that exponential lower bounds on the number of mistakes 
can be derived for the more general algorithm as well. First, note that since our kernel 
includes a feature for the empty monomial which is always true, the first parameter is 
already accounted for. For the other two parameters note that there is a degree of freedom 
between the learning rate a and fixed threshold 9 since multiplying both by the same factor 
does not change the hypothesis and therefore it suffices to consider the threshold only. We 
consider several cases for the value of the threshold. If 9 satisfies < ^ < 20 0^^ then we 
use the same sequence of examples. After the first two examples the algorithm makes a 
promotion on 1" (it may or may not update on 0" but that is not important). For the 
examples in the sequence the bounds on X^TeA SreB ^^^^ valid so the 

final inequality in the proof becomes w • x* > — 20-042n _|_ 20.049n ^ 20-047n ^j^j^,}^ jg ^j.^^ fgj. 

sufficiently large n. li 9 > 20-0^^" then we can construct the following scenario. We use the 
function / = xiVx2V...VXn, and the sequence of examples includes f — 1 repetitions of 
the same example x where the first bit is 1 and all other bits are 0. The example x satisfies 
exactly 2 monomials and therefore the algorithm will make mistakes on all the examples in 
the sequence. If < then the initial hypothesis misclassifies 0". We start the example 
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sequence by repeating the example 0" until it is classified correctly, that is \—9~\ times. 
If the threshold is large in absolute value e.g. 6 < —2^-^'^'^^ we are done. Otherwise we 
continue with the example 1"'. Since all weights except for the empty monomial are zero at 
this stage the examples 0" and 1" are classified in the same way so 1" is misclassified and 
therefore the algorithm makes a promotion. The argument for the rest of the sequence is as 
above (except for adding a term for the empty monomial) and the final inequality becomes 
wx' > -20-042n _ 20.042n ^ 20-049n > _20.042n ^j^g examples is misclassified. Thus 

in all cases kernel Perceptron may make an exponential number of mistakes. 
3.1 A Negative Result for the PAC Model 

The proof above can be adapted to give a negative result for kernel Perceptron in the PAC 
learning model (Valiant, 1984). In this model each example x is independently drawn from 
a fixed probability distribution V and with high probability the learner must construct a 
hypothesis h which has high accuracy relative to the target concept c under distribution V. 
See the Kearns-Vazirani text (1994) for a detailed discussion of the PAC learning model. 

Let V be the probability distribution over {0, 1}" which assigns weight 1/4 to the ex- 
ample 0", weight 1/4 to the example 1", and weight g ^n/leoo to each of the e"/^^*'" examples 

Theorem 8 // kernel Perceptron is run using a sample of polynomial size p{n) then with 
probability at least 1/16 the error of its final hypothesis is at least 0.49. 

Proof: With probability 1/16, the first two examples received from P will be 0" and then 
1". Thus, with probability 1/16, after two examples (as in the proof above) the Perceptron 
algorithm will have W0=O and all other coefficients of w'^ equal to 1. 

Consider the sequence of examples following these two examples. First note that in any 
trial, any occurrence of an example other than 1" (i.e. any occurrence either of some or of 
the 0"^ example) can decrease Y,Tc[n] by at most 2"^'^^. Since after the first two examples 
we have w'^ ■ = J2Tc[n] = 2" — 1, it follows that at least 2^^"'/^° — 1 more examples 

must occur before the 1" example will be incorrectly classified as a negative example. Since 
we will only consider the performance of the algorithm for p(n) < 2-*^^"/^*^ — 1 steps, we 
may ignore all subsequent occurrences of 1"" since they will not change the algorithm's 
hypothesis. 

Now observe that on the first example which is not 1" the algorithm will perform a 
demotion resulting in = — 1 (possibly changing other coefficients as well). Since no 
promotions will be performed on the rest of the sample, we get < — 1 for the rest of 
the learning process. It follows that all future occurrences of the example 0" are correctly 
classified and thus we may ignore them as well. 

Considering examples from the sequence constructed above, we may ignore any ex- 
ample that is correctly classified since no update is made on it. It follows that when the 
perceptron algorithm has gone over all examples, its hypothesis is formed by demotions on 
examples in the sequence of x*'s. The only difference from the scenario above is that the 
algorithm may make several demotions on the same example if it occurs multiple times in 
the sample. However, an inspection of the proof above shows that for any x* that has not 
been seen by the algorithm, the bounds on J2TeAi '"^t Yl,T&Bi are still valid and 
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therefore will be misclassified. Since the sample is of size p{n) and the sequence is of 
size e"/^^*''^ the probability weight of examples in the sample is at most 0.01 for sufficiently 
large n so the error of the hypothesis is at least 0.49. □ 

4. Computational Hardness of Kernel Winnow 

In this section, for x G {0, 1}" we let (f>{x) denote the (2" — l)-clement vector whose coor- 
dinates are all nonempty monomials (monotone conjunctions) over xi, . . . ,Xn- We say that 
a sequence of labeled examples {x^, bi),. . . , (x*, bt) is monotone consistent if it is consistent 
with some monotone function, i.e. x^ < x;^ for all A; = 1, . . . ,n implies hi < bj. If S is 
monotone consistent and has t labeled examples then clearly there is a monotone DNF 
formula consistent with S which contains at most t conjunctions. We consider the following 
problem: 

KERNEL WINNOW PREDICTION(a, 9) (KWP) 

Instance: Monotone consistent sequence S = (x^, 6i ),..., (x*, 6t) of labeled examples with 

each X* G {0, 1}™" and each bi G {—1, 1}; unlabeled example z G {0, 1}™". 

Question: Is uf^ ■ (j){z) > 9, where w'^ is the N = (2™ — l)-dimcnsional hypothesis vector 

generated by running Winnow(a, 9) on the example sequence {(j){x^), bi), . . . ((/)(x*), 6^)? 

In order to run Winnow over all 2™ — 1 nonempty monomials to learn monotone DNF, 
one must be able to solve KWP efficiently. Our main result in this section is a proof 
that KWP is computationally hard for a wide range of parameter settings which yield a 
polynomial mistake bound for Winnow via Theorem 2. 

Recall that #P is the class of all counting problems associated with NP decision prob- 
lems; it is well known that if every function in #P is computable in polynomial time then 
P = NP. See the book of Papadimitriou (1994) or the paper of Valiant (1979) for details 
on #P. The following problem is #P-hard (Valiant, 1979): 

MONOTONE 2-SAT (M2SAT) 

Instance: Monotone 2-CNF Boolean formula F = ci A C2 A . . . A with Cj = (yij V yi^) 
and each yi. G {yi, . . . , yn}', integer K such that 1 < K < 2". 

Question: Is |F~^(1)| > K, i.e. does F have at least K satisfying assignments in {0, 1}"? 

Theorem 9 Fix any e > 0. Let N = 2"^ — I, let a > 1 + l/m}~^, and let 9 > \ be such 
that max(;^^ • ^, (a + 1)(1 + log^^)) = poly{m). If there is a polynomial time algorithm 
for KWP{a,9), then every function in #P is computable in polynomial time. 

Proof: For AT, a and 9 as described in the theorem a routine calculation shows that 

1 + 1/m^-' < a < poly(m) and ———< 9 < (1) 

poly(m) 

The proof is a reduction from the problem M2SAT. The high level idea of the proof is 
simple: let (F, K) be an instance of M2SAT where F is defined over variables yi, . . . , y„. The 
Winnow algorithm maintains a weight wi^ for each monomial T over variables xi, . . . , x„. We 
define a 1-1 correspondence between these monomials T and truth assignments G {0, l}" 
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for F, and we give a sequence of examples for Winnow which causes Wj< !^ if F{y'^) = 
and tt;^ = 1 if F{y^) = 1. The value of w'f' ■ is thus related to Note that 

if we could control 9 as well this would be sufficient since we could use 9 = K and the 
result will follow. However ^ is a parameter of the algorithm. We therefore have to make 
additional updates so that w'l' ■ (j){z) 9 + - K) so that w'>' ■ (f){z) > 6* if and 

only if > /-T. The details are somewhat involved since we must track the resolution 

of approximations of the different values so that the final inner product will indeed give a 
correct result with respect to the threshold. 

General setup of the construction. In more detail, let 

• U = n + l+ [([log^4] + l)loga], 

• ^ = rgsi + i' 

and let m be defined as 

m = n + U + Wr? + WW + 3. (2) 

Since a > 1 + l/m^~^, using the fact that log(l + x) > a;/2 for < x < 1 we have that 
log a > l/(2m^~^), and from this it easily follows that m as specified above is polynomial in 
n. We describe a polynomial time transformation which maps an ra- variable instance (i^, 
of M2SAT to an m-variable instance {S, z) of KWP(a, 9) where S = (x^ 6i), . . . , (x*, 6t) 
is monotone consistent, each x* and z belong to {0, 1}"*, and w"^ ■ 0(2;) > 9 '\i and only if 
> K. 

The Winnow variables xi, . . . ,Xm are divided into three sets A^B and C where A = 
{xi, . . . , x„}, B = {xn+i, • • . , Xn+u} and C = {xn+u+i, • • • ) Xm}- The unlabeled example z 
is I'^+^Qm-n-u ^ J g variables in A and B are set to 1 and all variables in C are set to 0. 
We thus have w'^-^{z) = Ma+Mb+Mab where Ma = So^^tca = J2$jtTCB '^t ^^'^ 

Mab = Y^TCAuB TnAj^d TnB^d ^^T- refer to monomials $ T Q A as type- A monomials, 
monomials 7^ T C i? as type-B monomials, and monomials T C ALlB,TriA ^ $,Tr\B ^ 
as type-AB monomials. 

The example sequence S is divided into four stages. Stage 1 results in Ma ~ |-F~^(1)|; 
as described below the n variables in A correspond to the n variables in the CNF formula 
F. Stage 2 results in Ma ~ a'^|F~^(l)| for some positive integer q which we specify later. 
Stages 3 and 4 together result in Mb + Mab ^ 9 — a^K. Thus the final value of w'^ • cj){z) is 
approximately 9 + q:«(|F-1(1)| - K), so we have w'>' ■ ^{z) > 6* if and only if > K. 

Since all variables in C are in z, if T includes a variable in C then the value of 
does not aflFect w'^ ■ (p{z). The variables in C arc "slack variables" which (i) make Winnow 
perform the correct promotions/demotions and (ii) ensure that S is monotone consistent. 

Stage 1: Setting Ma ~ We define the following correspondence between 

truth assignments y-^ G {0, 1}" and monomials T C A : yf = if and only if Xj is not 
present in T. For each clause yi-^ V yi,^ in F, Stage 1 contains V negative examples such that 
Xjj = Xjj = and Xj = 1 for all other Xj G A. We show below that (1) Winnow makes a 
false positive prediction on each of these examples and (2) in Stage 1 Winnow never does a 



350 



Efficiency versus Convergence of Boolean Kernels 



promotion on any example which has any variable in A set to 1. Consider any y such that 
F{'ip-) = 0. Since our examples include an example such that y-^ < y^ the monomial T 
is demoted at least V times. As a result after Stage 1 we will have that for all T, Wj. = 1 
if F(y^) = 1 and < < a'^ if F(y^) = 0. Thus we wiU have Ma = + 7i for 

some < 71 < T^a"^ < \. 

We now show how the Stage 1 examples cause Winnow to make a false positive prediction 
on negative examples which have = = {) and Xi = 1 for all other i in ^ as described 
above. For each such negative example in Stage 1 six new slack variables xp^i, . . . , x^g+e G C 
are used as follows: Stage 1 has [logo, (0/3)] repeated instances of the positive example which 
has Xjs^i = .x^_|-2 = 1 and all other bits 0. These examples cause promotions which result 
in 6 < wt„ + wt„ „ + wt„ ^„ „ < OiO and hence wt„ > 6/3. Two other groups of 
similar examples (the first with x/3^3 = = 1, the second with x^^+s = x^+e = 1) cause 
u^x^^g > 0/3 and w^^_^^ > 0/3. The next example in S is the negative example which has 
Xij = Xi^ = 0, Xi = I for all other Xi in A, x^+i = xp^^, = xp^^ = 1 and all other bits 0. 
For this example w'^ ■ (p{x) > wtp+i + '^tf^+z + '^ii3+z ^ ^ so Winnow makes a false positive 
prediction. 

Since F has at most clauses and there are V negative examples per clause, this 
construction can be carried out using QVv? slack variables Xn+u+i > • ■ • > a^n+cz+eVn^ ■ We 
thus have (1) and (2) as claimed above. 

Stage 2: Setting Ma ~ ofl\F-'^{l)\. The first Stage 2 example is a positive example 
with Xi = 1 for all xi G A, x^j^uj^qy^-^j^i = 1 and all other bits 0. Since each of the 2" 
monomials which contain x„_|_[/_|_5y„2_|_i and are satisfied by this example have Wj, = 1, 
we have w't' ■ (f){x) = 2" + \F-'^{\)\ + 71 < 2"+^ Since 9 > 2"*/poly(m) > 2"+^ (recall 
from equation (2) that m > 6n^), after the resulting promotion we have w'^ ■ (f){x) = 
a(2" + |F-i(l)| + 71) < a2"+^ Let 

g=riogJ0/2"+i)l-l 

so that 

^92"+! < e < a''+i2"+\ (3) 

Stage 2 consists of q repeated instances of the positive example described above. After 
these promotions we have w'^ ■ (p{x) = a'^(2" + |F~^(1)| + 71) < 0^2*^+^ < 9. Since 1 < 
|F~^(1)| + 71 < 2" we also have 

a« < Ma = a''{\F-\l)\ + 71) < a«2" < 0/2. (4) 

Equation (4) gives the value which Ma will have throughout the rest of the argument. 

Some Calculations for Stages 3 and 4. At the start of Stage 3 each type-B and type- 

AB monomial T has = 1. There are n variables in A and U variables in B so at the 
start of Stage 3 we have = 2^ - 1 and Mab = (2" - 1)(2^ - 1). Since no example in 
Stages 3 or 4 satisfies any Xj in A, at the end of Stage 4 Ma will still be a^(|F^^(l)| + 71) 
and Mab will still be (2" - 1)(2'^ - 1). Therefore at the end of Stage 4 we have 

u;^ • 4>iz) = Mb + cxWF-\l)\ + 71) + (2" - 1)(2^ - 1). 
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To simplify notation let 

D = e-{T'- 1)(2^ - 1) - a'iK. 

Ideally at the end of Stage 4 the value of Mb would be — a'^^i since this would imply that 
w'^ ■ (l){z) = e + a'i{\F-^{l)\ - K) which is at least if and only if > K. However it 

is not necessary for Mb to assume this exact value, since must be an integer and 

< 71 < 5. As long as 

D < Mb < D + ]^a'i (5) 

we get that 

e + a<i{\F-\l)\ - if + 71) < u;^ • <t>{z) <e + a«(|F-i(l)| - if + 71 + 

Now if \F-^{1)\ >K we clearly have w't' ■ ^(z) > 6. On the other hand if \F-^{1)\ < K 
then since is an integer value < K — 1 and we get w'^ ■(f){z) < 9. Therefore 

all that remains is to construct the examples in Stages 3 and 4 so that that Mb satisfies 
Equation (5). 

We next calculate an appropriate granularity for D. Note that if < 2", so by Equa- 
tion (3) we have that 9 — a'^K > 9/2. Now recall from Equations (2) and (1) that m > 
n + U + 6n^ and e> 2'"/poly(m), so 9/2 > 2"+f^+^"Vpoly(m) » 2"2^. Consequently we 
certainly have that D > 9/4, and from Equation (3) we have that D > 9/4 > a^2"~^ > \a'^. 
Let 

c= rioga4], 

so that we have 

a^-c < < D. (6) 
4 

There is a unique smallest positive integer p> 1 which satisfies D < pa^~'^ < D + \a^. The 
Stage 3 examples will result in Mb satisfying p < Mb < P + \- We now have that: 

ai-'= < D < pa^-"" < D + ^ai 

< 9-^0" (7) 

< a«+^2"+i - 3a«-^ (8) 
= a«-^- (a^+^2"+i -3). (9) 

Here (7) holds since if > 1, and thus (by definition of D) we have D + a*^ < 9 which is 
equivalent to Equation (7). Inequality (8) follows from Equations (6) and (3). 
Hence we have that 

l<p< a^+^2"+^ - 3 < 2"+i+r(c+i) logal _ 3 ^ 2^ - 3, (10) 

where the second inequality in the above chain follows from Equation (9). We now use the 
following lemma: 
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Lemma 10 For all i > 1, for all 1 < p < 2^ — 1, there is a monotone CNF F£^p over 
i Boolean variables which has at most I clauses, has exactly p satisfying assignments in 
{0, 1}^, and can be constructed from i and p in poly{t) time. 

Proof: The proof is by induction on L For the base case £ = 1 we have p = 1 and Ff^^p = xi. 
Assuming the lemma is true for i = 1, . . . ,k we now prove it for i = k + I : 

If 1 < p < 2^^ — 1 then the desired CNF is = x^+i A F^^p. Since F^^p has at most k 

clauses -Ffe+i,p has at most A; + 1 clauses. If 2^^ + 1 < p < 2*^+^ — 1 then the desired CNF is 
-Ffc+i^p = V Ff. p_2k. By distributing x^ over each clause of F^ p_2k we can write -Ffc+i,p 
as a CNF with at most k clauses. If p = 2*^ then F^p = x\. □ 

Stage 3: Setting Mb ~ p- Let be an r-clause monotone CNF formula over the 

U variables in B which has p satisfying assignments. Similar to Stage 1, for each clause 
of Fu^p, Stage 3 has W negative examples corresponding to that clause, and as in Stage 
1 slack variables in C are used to ensure that Winnow makes a false positive prediction 
on each such negative example. Thus the examples in Stage 3 cause Mb = p + 72 where 
< 72 < 2^ < ^. Since six slack variables in C arc used for each negative example 
and there are rW < UW negative examples, the slack variables a;„_(.[7+6yn2^2) • • ■ i are 
sufficient for Stage 3. 

Stage 4: Setting Mb + Mab ~ — cx'^K. All that remains is to perform q — c 
promotions on examples which have each Xj in B set to 1. This will cause Mb to equal 
(p + 72)0;^"'^. By the inequalities established above, this will give us 

D < pa"^-" <{p + 72)0;''"'' = Mb <D+^a'^ + 720;""'= <D + ^a'i 
which is as desired. 

In order to guarantee q — c promotions we use two sequences of examples of length 
q — f^^^l and [j^^l — c respectively. We first show that these are positive numbers. It 
follows directly from the definitions U = ra + 1 + [([log„4] + l)logQ;] and c = [log^4] 
that ^ > c. Since 6 > 2^"' (by definit ion of m and Equation (1)) and a is bounded 
by a polynomial in m, we clearly have that log(^/2"'"'"-'^) > U — n + log(Q;). Now since 
q = riog,(0/2-+i)l - 1 this implies that q > '^^^^g^ - 1 > [^1 , so that q - \^] > 0. 

The first q — [^^1 examples in Stage 4 are all the same positive example which has 
each Xi in B set to 1 and x^-i = 1- The first time this example is received, we have 
w'^ ■ (f){x) = 2^ + p + 72 < 2^+'^. Since 9 > 26"^ by inspection of U we have 2^+^ < 6, so 
Winnow performs a promotion. Similarly, after q — f^^] occurrences of this example, we 
have 

■ (j){x) = a''-&'^ {2^ +P + 72) < a'^-&h^+^ < a''2"+i < 9 
so promotions are indeed performed at each occurrence, and 

„ r U—n -] 

MB = a^ ' '(p + 72). 

The remaining examples in Stage 4 arc [^^^1 — c repetitions of the positive example x 
which has each Xj in B set to 1 and Xm = 1. If promotions occurred on each repetition of 



353 



Khardon, Roth, & Servedio 



this example then we would have w'^ ■ ^{x) = a' loga ' (2*^ + ' log" ' (p + 72)), so we need 

r t/— n -| ^ 

only show that this quantity is less than Q. We reexpress this quantity as a' 1 2^^ + 
Q^-cj'p + 72)- We have 

< e-^a^ + ^a'i (11) 

< e-'-a^ 

r U—n -| 

where (11) follows from (7) and the definition of c. Finally, we have that a' 'og« ' 2^ < 
^ 22C/-n-cioga < q, . 22C/-n-2 ^ ^ _^ ^ i^q^ ^^^^^ ^j^g j^^g^ inequality is by Equation (3) 

and the previous inequality is by inspection of the values of a, 9 and U. Combining the two 
bounds above we sec that indeed w'^ ■ (f){x) < 6. 

Finally, we observe that by construction the example sequence S is monotone consistent. 
Since m = poly(n) and S contains poly(n) examples the transformation from M2SAT to 
KWP(a, 9) is polynomial-time computable and the theorem is proved. □(Theorem 9) 



5. Conclusion 

Linear threshold functions are a weak representation language for which we have inter- 
esting learning algorithms. Therefore, if linear learning algorithms are to learn expressive 

functions, it is necessary to expand the feature space over which they arc applied. This 
work explores the tradeoff between computational efficiency and convergence when using 
expanded feature spaces that capture conjunctions of base features. 

We have shown that while each iteration of the kernel Perceptron algorithm can be 
executed efficiently, the algorithm can provably require exponentially many updates even 
when learning a function as simple as f{x) — X\X2 • . . Xff On the other hand, the kernel 
Winnow algorithm has a polynomial mistake bound for learning polynomial-size monotone 
DNF, but our results show that under a widely accepted computational hardness assumption 
it is impossible to efficiently simulate the execution of kernel Winnow. The latter also implies 
that there is no general construction that will run Winnow using kernel functions. 

Our results indicate that additive and multiplicative update algorithms lie on opposite 
extremes of the tradeoff between computational efficiency and convergence; we believe that 
this fact could have significant practical implications. By demonstrating the provable lim- 
itations of using kernel functions which correspond to high-degree feature expansions, our 
results also lend theoretical justification to the common practice of using a small degree in 
similar feature expansions such as the well-known polynomial kernel.^ 

Since the publication of the initial conference version of this work (Khardon, Roth, &; 
Servedio, 2002), several authors have explored closely related ideas. One can show that our 
construction for the negative results for Perceptron does not extend (either in the PAC or 

2. Our Boolean kernels are different than standard polynomial kernels in that all the conjunctions are 
weighted equally, and also in that we allow negations. 
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online setting) to related algorithms such as Support Vector Machines which work by con- 
structing a maximum margin hypothesis consistent with the examples. The paper (Khardon 
&; Servedio, 2003) gives an analysis of the PAC learning performance of maximum margin 
algorithms with the monotone monomials kernel, and derives several negative results thus 
giving further negative evidence for the monomial kernel. In the paper (Cumby &; Roth, 
2003) a kernel for expressions in description logic (generalizing the monomials kernel) is 
developed and successfully applied for natural language and molecular problems. Taki- 
moto and Warmuth (2003) study the use of multiplicative update algorithms other than 
Winnow (such as weighted majority) and obtain some positive results by restricting the 
type of loss function used to be additive over base features. Chawla et al. (2004) have 
studied Monte Carlo estimation approaches to approximately simulate the Winnow algo- 
rithm's performance when run over a space of exponentially many features. The use of 
kernel methods for logic learning and developing alternative methods for feature expansion 
with multiplicative update algorithms remain interesting and challenging problems to be 
investigated. 
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